Document dissimilarity within and across languages: A benchmarking study

نویسندگان

  • Richard S. Forsyth
  • Serge Sharoff
چکیده

Quantifying the similarity or dissimilarity between documents is an important task in authorship attribution, information retrieval, plagiarism detection, text mining, and many other areas of linguistic computing. Numerous similarity indices have been devised and used, but relatively little attention has been paid to calibrating such indices against externally imposed standards, mainly because of the difficulty of establishing agreed reference levels of inter-text similarity. The present article introduces a multi-register corpus gathered for this purpose, in which each text has been located in a similarity space based on ratings by human readers. This provides a resource for testing similarity measures derived from computational text-processing against reference levels derived from human judgement, i.e. external to the texts themselves. We describe the results of a benchmarking study in five different languages in which some widely used measures perform comparatively poorly. In particular, several alternative correlational measures (Pearson r, Spearman rho, tetrachoric correlation) consistently outperform cosine similarity on our data. A method of using what we call ‘anchor texts’ to extend this method from monolingual inter-text similarity-scoring to inter-text similarity-scoring across languages is also proposed and tested. .................................................................................................................................................................................

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Benchmarking Professional Ethics Documents at Universities; Providing the Pattern and Solutions

This research aims to benchmark (national and international) experiences in the field of professional ethics documents at universities and provides solutions to improve it. The study population consisted of 100 top universities in the world (based on the Shanghai ranking) and 539 universities in Iran. Of these, 59 universities in Iran and 100 universities abroad have professional ethics documen...

متن کامل

Perceptual Learning Style Preferences and Computer-Assisted Writing Achievement within the Activity Theory Framework

Learning styles are considered among the significant factors that aid instructors in deciding how well their students learn a second or foreign language (Oxford, 2003). Although this issue has been accepted broadly in educational psychology,further research is required to examine the relationship between learning styles and language learning skills. Thus, the present study was carried out to in...

متن کامل

سوابق بهینه کاوی مدیران و کارشناسان مسئول حوزه ستادی معاونت بهداشتی دانشگاه های علوم پزشکی ایران در سال های 1388-1386

 Introduction: Benchmarking is used to identify the successful experiences and achievements of a business to develop and improve organizational performance. This study aimed to determine, firstly, the frequency of benchmarking made by administrators and officers at Health Deputy headquarters of Iranian universities of medical sciences and, secondly, the relationship of this frequency to individ...

متن کامل

Combining Subword and State-level Dissimilarity Measures for Improved Spoken Term Detection in NTCIR-11 SpokenQuery&Doc Task

In recent years, demands for distributing or searching multimedia contents are rapidly increasing and more effective method for multimedia information retrieval is desirable. In the studies on spoken document retrieval systems, much research has been presented focusing on the task of spoken term detection (STD), which locates a given search term in a large set of spoken documents. Recently, in ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • LLC

دوره 29  شماره 

صفحات  -

تاریخ انتشار 2014